- Feed-forward neural net
- Recurrent neural net
- SRN
- LSTM
- GRU
- Bi-LSTM
A machine learning subfield of learning representations of data. Exceptional effective at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of) representation by using a hierarchy of multiple layers
If you provide the system tons of information, it begins to understand it and respond in useful ways.
\[h = \sigma(W_1x + b_1)\] \[y = \sigma(W_2h + b_2)\]
\[h = \sigma(W_1x + b_1)\]
\[y = \sigma(W_2h + b_2)\]
4 + 2 = 6 neurons (not counting inputs)
[3 x 4] + [4 x 2] = 20 weights
4 + 2 = 6 biases
26 learnable parameters
Optimize (min. or max.)
objective/cost function \(J\)\((\theta)\)
Generate
error signal that measures difference between predictions and target values
Use error signal to change the
weights and get more accurate predictions
Subtracting a fraction of the gradient moves you towards the (local) minimum of the cost function
Define objective to minimize error: \[E(W) = \sum_{d \in D} \sum_{k \in K} (t_{kd} - O_{kd})^2\]
where \(D\) is the set of training examples, \(K\) is the set of output units, \(t_{kd}\) and \(o_{kd}\) are, respectively, the teacher and current output for unit \(k\) for example \(d\).
The derivative of a sigmoid unit with respect to net input is: \[\frac{\partial{o_j}}{\partial{net_j}} = o_j(1-o_j)\]
Learning rule to change weights to minimize \[\Delta w_{ji} = - \eta \frac{\partial{E}}{\partial{w_{ji}}}\]
Each weight changed by: \[\Delta w_{ji} = \eta \delta_j o_i \] \[\delta_j = o_j(1 - o_j)(t_j - o_j) \ \ \ \text{if } j \text{ is an output unit}\] \[\delta_j = o_j(1 - o_j)\sum_k{\delta_k w_{kj}} \ \ \ \text{if } j \text{ is a hidden unit}\]
where \(\eta\) is a constant called the learning rate
\(t_j\) is the correct teacher output for unit \(j\)
\(\delta_j\) is the error measure for unit \(j\)
objective/cost function \(J\)\((\theta)\)
Update each element of \(\theta\):
\[\theta^{new}_j = \theta^{old}_j - \alpha \frac{d}{\theta^{old}_j} J(\theta)\]
Matrix notation for all parameters ( \(\alpha\): learning rate):
\[\theta^{new}_j = \theta^{old}_j - \alpha \nabla _{\theta}J(\theta)\]
Recursively apply chain rule though each node
objective/cost function \(J\)\((\theta)\) Review of backpropagation
Update each element of \(\theta\):
\[\theta^{new}_j = \theta^{old}_j - \alpha \frac{d}{\theta^{old}_j} J(\theta)\]
Matrix notation for all parameters ( \(\alpha\): learning rate):
\[\theta^{new}_j = \theta^{old}_j - \alpha \nabla _{\theta}J(\theta)\]
Recursively apply chain rule though each node
Current output: \(o_j=0.2\)
Current output: \(o_j=0.2\)
Correct output: \(t_j=1.0\)
Current output: \(o_j=0.2\)
Correct output: \(t_j=1.0\)
Error \(\delta_j = o_j(1–o_j)(t_j–o_j)\)
\(0.2(1–0.2)(1–0.2)=0.128\)
Current output: \(o_j=0.2\)
Correct output: \(t_j=1.0\)
Error \(\delta_j = o_j(1–o_j)(t_j–o_j)\)
\(0.2(1–0.2)(1–0.2)=0.128\)
Update weights into \(j\)
\[\Delta w_{ji} = \eta \delta_j o_i\]
Current output: \(o_j=0.2\)
Correct output: \(t_j=1.0\)
Error \(\delta_j = o_j(1–o_j)(t_j–o_j)\)
\(0.2(1–0.2)(1–0.2)=0.128\)
Update weights into \(j\)
\[\Delta w_{ji} = \eta \delta_j o_i\]
Current output: \(o_j=0.2\)
Correct output: \(t_j=1.0\)
Error \(\delta_j = o_j(1–o_j)(t_j–o_j)\)
\(0.2(1–0.2)(1–0.2)=0.128\)
Update weights into \(j\)
\[\Delta w_{ji} = \eta \delta_j o_i\]
\[\delta_j = o_j(1-o_j)\sum_k{\delta_kw_{kj}}\]
\[\delta_j = o_j(1-o_j)\sum_k{\delta_kw_{kj}}\]
\[\delta_j = o_j(1-o_j)\sum_k{\delta_kw_{kj}}\]
\[\delta_j = o_j(1-o_j)\sum_k{\delta_kw_{kj}}\]
\[\delta_j = o_j(1-o_j)\sum_k{\delta_kw_{kj}}\]
\[\delta_j = o_j(1-o_j)\sum_k{\delta_kw_{kj}}\]
Update weights into \(j\)
\[\Delta w_{ji} = \eta \delta_j o_i\]
\[\delta_j = o_j(1-o_j)\sum_k{\delta_kw_{kj}}\]
Update weights into \(j\)
\[\Delta w_{ji} = \eta \delta_j o_i\]
Create the 3-layer network with \(H\) hidden units with full connectivity between layers. Set weights to small random real values.
Until all training examples produce the correct value (within \(\epsilon\)), or mean squared error ceases to decrease, or other termination criteria:
Begin epoch
For each training example, \(d\), do:
End epoch
Non-linearities needed to learn complex (non-linear) representations of data, otherwise the NN would be just a linear function \(W_1W_2x = Wx\)
More layers and neurons can approximate more complex functions
Full list: https://en.wikipedia.org/wiki/Activation_function
Takes a real-valued number and “squashes” it into range between 0 and 1. \[R^n \rightarrow [0,1]\]
Takes a real-valued number and “squashes” it into range between -1 and 1. \[R^n \rightarrow [-1,1]\]
Most Deep Networks use ReLU nowadays
Takes a real-valued number and thresholds it at zero \(f(x) = max(0,x)\) \[R^n \rightarrow R^n_+\]
Learned hypothesis may fit the training data very well, even outliers ( noise) but fail to generalize to new examples (test data)
\[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\]
\[i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\] \[\tilde{C}_t = tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\]
\[C_t = f_t * C_{t-1} + i_t * \tilde{C}_t\]
\[o_t = \sigma(W_o[h_{t-1}, x_t] + b_o)\] \[h_t = o_t * tanh(C_t)\]
POS Tagging
https://www.aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
NER
Phrase Chunking
\[z_t = \sigma(W_z \cdot [h_{t-1}, x_t])\] \[r_t = \sigma(W_r \cdot [h_{t-1}, x_t])\] \[\tilde{h}_t = tanh(W \cdot [r_t * h_{t-1}, x_t])\] \[h_t = (1 - z_t) * h_{t-1} + z_t * \tilde(h)_t\]
\[score(h_{t-1},\bar{h}_s) = h_t^T\bar{h}_s\]
Compare target and source hidden states
\[score(h_{t-1},\bar{h}_s) = h_t^T\bar{h}_s\]
Compare target and source hidden states
\[a_t(s) = \frac{e^{score(s)}}{\sum_{s'}{e^{score(s')}}}\]
Convert into alignment weights
\[c_t = \sum_s{a_t(s)\bar{h}_s}\]
Build context vector: weighted average
\[h_t = f(h_{t-1}, c_t, e_t)\]
Compute next hidden state